140 research outputs found

    CycleCounter: an Efficient and Accurate UltraSPARC III CPU Simulation Module

    No full text
    This paper presents a novel technique for cycle-accurate simulation of the Central Processing Unit (CPU) of a modern superscalar processor, the UltraSPARC III Cu processor. The technique is based on adding a module to an existing fetch-decode-execute style of CPU simulator, rather than the traditional method of fully implementing the CPU pipeline and microarchitecture. The main functions of the module are the simulation of instruction grouping, register interlocks and the store buffer, and has a simple table-driven implementation which permits easy modification for exploring microarchitectural variations. The technique results on a 15--30\% loss of simulation speed, instead of a 10 Ă—\times or greater performance loss by fully implementing the detailed micro-architecture. The accuracy of the technique is validated against an actual UltraSPARC III Cu processor, and achieves high levels of accuracy in cases of interest

    Accelerated methods for performing the LDLT decomposition

    Get PDF
    This paper describes the design, implementation and performance of parallel direct dense symmetric-indefinite matrix factorisation algorithms. These algorithms use the Bunch-Kaufman diagonal pivoting method. The starting point is numerically identical to LAPACK _sytrf() algorithm, but out-performs zsytrf() by approximately 15% for large matrices on the UltraSPARC family of processors. The first variant reduces symmetric interchanges, particularly important for parallel implementation, by taking into account the growth attained by any preceding columns that did not require any interchanges. However, it achieves the same growth bound. The second variant uses a lookahead technique with heuristic methods to predict whether interchanges are required over the next block column; if so, the block column can be eliminated using modified Cholesky methods, which can yield both computational and communication advantages. These algorithms yield best performance gains on `weakly indefinite' matrices (i.e. those which have generally large diagonal elements), which often arise from electro-magnetic field analysis applications. On UltraSPARC processors, the first variant generally achieves a 1--2% performance gain; the second is faster still for large matrices by 2% for complex double precision and 6% for double precision. However, larger performance gains are observed on distributed memory machines, where symmetric interchanges are relatively more expensive. On a 16 node 300MHz UltraSPARC-based Fujitsu AP3000, the first variant achieved a 10-15% improvement for small-moderate sized matrices, decreasing to 7% for large matrices. For N=10000 , it achieved a sustained speed of 5.6GFLOPs and a parallel speedup of 12.8

    Investigating Decision Support Techniques for Automating Cloud Service Selection

    Full text link
    The compass of Cloud infrastructure services advances steadily leaving users in the agony of choice. To be able to select the best mix of service offering from an abundance of possibilities, users must consider complex dependencies and heterogeneous sets of criteria. Therefore, we present a PhD thesis proposal on investigating an intelligent decision support system for selecting Cloud based infrastructure services (e.g. storage, network, CPU).Comment: Accepted by IEEE Cloudcom 2012 - PhD consortium trac

    Optimal load balancing techniques for block-cyclic decompositions for matrix factorization

    Get PDF
    In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel block-partitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re-)distributed among all processors. It is shown how this technique can be eÆciently implemented for the general block-cyclic matrix distribution, requiring only the collective communication primitives that required for block-cyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance and minimize the cost of panel re-distribution, storage block sizes should be kept small; furthermore, in many situations of interest, there will be no significant communication startup penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization for each method that is developed here

    A dense complex symmetric indefinite solver for the Fujitsu AP3000

    Get PDF

    City Data Fusion: Sensor Data Fusion in the Internet of Things

    Full text link
    Internet of Things (IoT) has gained substantial attention recently and play a significant role in smart city application deployments. A number of such smart city applications depend on sensor fusion capabilities in the cloud from diverse data sources. We introduce the concept of IoT and present in detail ten different parameters that govern our sensor data fusion evaluation framework. We then evaluate the current state-of-the art in sensor data fusion against our sensor data fusion framework. Our main goal is to examine and survey different sensor data fusion research efforts based on our evaluation framework. The major open research issues related to sensor data fusion are also presented.Comment: Accepted to be published in International Journal of Distributed Systems and Technologies (IJDST), 201

    An infrastructure service recommendation system for cloud applications with real-time QoS requirement constraints

    Get PDF
    The proliferation of cloud computing has revolutionized the hosting and delivery of Internet-based application services. However, with the constant launch of new cloud services and capabilities almost every month by both big (e.g., Amazon Web Service and Microsoft Azure) and small companies (e.g., Rackspace and Ninefold), decision makers (e.g., application developers and chief information officers) are likely to be overwhelmed by choices available. The decision-making problem is further complicated due to heterogeneous service configurations and application provisioning QoS constraints. To address this hard challenge, in our previous work, we developed a semiautomated, extensible, and ontology-based approach to infrastructure service discovery and selection only based on design-time constraints (e.g., the renting cost, the data center location, the service feature, etc.). In this paper, we extend our approach to include the real-time (run-time) QoS (the end-to-end message latency and the end-to-end message throughput) in the decision-making process. The hosting of next-generation applications in the domain of online interactive gaming, large-scale sensor analytics, and real-time mobile applications on cloud services necessitates the optimization of such real-time QoS constraints for meeting service-level agreements. To this end, we present a real-time QoS-aware multicriteria decision-making technique that builds over the well-known analytic hierarchy process method. The proposed technique is applicable to selecting Infrastructure as a Service (IaaS) cloud offers, and it allows users to define multiple design-time and real-time QoS constraints or requirements. These requirements are then matched against our knowledge base to compute the possible best fit combinations of cloud services at the IaaS layer. We conducted extensive experiments to prove the feasibility of our approach

    Profiling Methodology and Performance Tuning of the Met Office Unified Model for Weather and Climate Simulations

    Get PDF
    Global weather and climate modelling is a compute-intensive task that is mission-critical to government departments concerned with meteorology and climate change. The dominant component of these models is a global atmosphere model. One such model, the Met Office Unified Model (MetUM), is widely used in both Europe and Australia for this purpose. This paper describes our experiences in developing an efficient profiling methodology and scalability analysis of the MetUM version 7.5 at both low scale and high scale atmosphere grid resolutions. Variability within the execution of the MetUM and variability of the run-time of identical jobs on a highly shared cluster are taken into account. The methodology uses a lightweight profiler internal to the MetUM which we have enhanced to have minimal overhead and enables accurate profiling with only a relatively modest usage of processor time. At high-scale resolution, the MetUM scaled to core counts of 2048, with load imbalance accounting a significant fraction the loss from ideal performance. Recent patches have removed two relatively small sources of inefficiency. Internal segment size parameters gave a modest performance improvement at low-scale resolution (such as are used in climate simulation); this however was not significant a higher scales. Near-square process grid configurations tended to give the best performance. Byte-swapping optimizations vastly improved I/O performance, which has in turn a large impact on performance in operational runs
    • …
    corecore